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Abstract 



Inspired by the ideas from the field of stochastic approximation, we propose a ran- 
domized algorithm to compute the capacity of a finite-state channel with a Markovian 
input. When the mutual information rate of the channel is concave with respect to 
the chosen parameterization, we show that the proposed algorithm will almost surely 
converge to the capacity of the channel and derive the rate of convergence. We also 
discuss the convergence behavior of the algorithm without the concavity assumption. 

> 

\o '. 1 Introduction 

cn 

Discrete-time finite-state channels are a broad class of channels which have attracted plenty 
£T) • of interest in information theory; prominent examples of such channels include partial re- 

sponse channels [HI ED], Gilbert-Elliott channels [3T1 [T5] and noisy input-restricted chan- 
nels [55], which are widely used in a variety of real-life applications, including magnetic and 
optical recording [36], communications over band-limited channels with inter-symbol inter- 
ference [TT| . The computation of the capacity of a finite-state channel is notoriously difficult 
and has been open for decades. For a discrete memoryless channel with a discrete memory- 
less source at its input, the classical Blahut-Arimoto algorithm (BAA) [21 US] can effectively 
compute the channel capacity, however, for almost all nontrivial finite-state channels, little 
is known about the channel capacity other than some bounds (see, e.g., [55], [16], [5] and 
references therein), which are numerically computed using Monto Carlo approaches. The 
methods in these work are believed to produce fairly precise numerical approximations of 
the capacity of certain classes of finite-state channels, however there are no general proofs 
to justify such beliefs. 

Recently, Vontobel et al. have proposed a generalized Blahut-Arimoto algorithm (GBAA) 
[53] to maximize the mutual information rate of a finite-state machine channel with a finite- 
state machine source at its input. This interesting algorithm has attracted a great deal of 
attention due to the observations that it fairly precisely approximates the channel capacity 
for a number of practical channels. (Notably, some results that were derived in the context 



of the GBAA have proven to be useful for analyzing the Bethe entropy function of some 
graphical models that appear in the context of low-density parity-check codes [51j and for 
approximately computing the permanent of a non-negative matrix [52J . ) For a finite-state 
channel, let X denote the input Markov process and Y its corresponding output process, 
which, by definition, is a hidden Markov process |13j . In contrast to the BAA, the convergence 
of the GBAA depends on the extra assumption that I(X; Y) and H(X\Y) are both concave 
with respect to a chosen parameterization, which has been posed as Conjecture 74 in [53J. 
Example I9.4[ however, shows that the concavity conjecture is not true in general; for other 
examples showing I(X; Y) and H(X\Y) fail to be concave, see [32] . 

One of the hurdles encountered in computing the finite-state channel capacity is the 
problem of optimizing H(Y), which naturally occurs in the formula of the capacity of a broad 
class of finite-state channels. More specifically, there has long been a lack of understanding 
on the following two issues: 

(I) How to effectively compute the entropy rate of hidden Marov processes? 

(II) How does the entropy rate of hidden Markov processes vary as the underlying Markov 
processes and the channels vary? 

As elaborated below, recently, these two issues have been partially addressed by the 
information theory community 

Related work on (I). It is well known that H(X) has a simple analytic formula; in 
stark contrast, there is no simple and explicit formula of H{Y) for most non-degenerate 
channels ever since hidden Markov processes (or, more precisely, hidden Markov models) 
were formulated more than half a century ago. Here, we remark that Blackwell [11] showed 
that H(Y) can be written as an integral of an explicit function on a simplex with respect to 
the Blackwell Measure. However, the Blackwell measure seems to be rather complicated for 
effective computation of H{Y). Since 2000, there has been a rebirth of interest in computing 
and estimating H(Y) in a variety of scenarios: the Blackwell measure has been used to 
bound H(Y) [39], a variation on the classical Birch bounds [10] can be found in [16] and a 
new numerical approximation of H(Y) has been proposed in [35]. Generalizing Blackwell's 
idea, an integral formula for the derivatives of H(Y) has been derived in [14] . 

The celebrated Shannon-McMillan-Breiman theorem states that the n-th order sample 
entropy — \ogp{Yl l ) /n converges to H(Y) almost surely. Based on this, efficient Monte Carlo 
methods for approximating H(Y) were proposed independently by Arnold and Loeliger [I], 
Pfister, Soriaga and Siegel [12], Sharma and Singh [4T] . However, more quantitative de- 
scription of the convergence behavior of the proposed methods, such as rate of convergence, 
asymptotic normality and so on, are lacking in these work. Recently, a central limit theorem 
(CLT) [13] for the sample entropy has been derived as a corollary of a CLT for the top 
Lyapunov exponent of a product of random matrices; a functional CLT has also been estab- 
lished in [28]. To some extent, these two CLTs suggested that the Monte Carlo methods are 
"accurate" in terms of approximating H(Y). There are also other related work in different 
contexts from outside the information theory community, such as [501 EZl ES] ■ 

Recently, we have obtained [TH] a number of limit theorems for the sample entropy of 
Y. These limit theorems can be viewed as further refinements of the Shannon-McMillian- 
Breiman theorem, which is the backbone of information theory. More specifically, Theorem 
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1.2 in [19] is a CLT with an error-estimate, which can be used to characterize the rate of 
convergence of the Monte Carlo methods in [U H2J ST], and Theorem 1.5 in [19] is a large 
deviation result, which gives a sub-exponential decaying upper bound on the probability of 
the sample entropy — \ogp(Y™)/n deviating from H(Y). Among many other applications, 
such as deriving non-asymptotic coding theorems [SI], these theorems positively confirmed 
the effectiveness of using the Shannon-McMillan-Breiman theorem to approximate H(Y). 

Related work on (II). The behavior of H(Y) (as a function of the underlying Markov 
chain and the channel) is of significance in a number of scientific disciplines; particularly 
in information theory, it is of great importance for computing/estimating the capacity of 
finite-state channels. However, some of the basic problems, such as smoothness (or even 
differentiability) of H(Y), have long remained unknown. Recently, asymptotical behavior 
of H(Y) has been studied in [3l [291 E9l SQl [56l EZl [381 SH S3]- Particularly in [S6], for a 
special type of hidden Markov chain Y, the Taylor series expansion of H(Y) is given under 
the assumption that H(Y) is analytic. Under mild assumptions, analyticity of H(Y) has 
been established in |20j ; see also related work in [T3J [SH EZJ [TJ ESI S3] and references therein. 
The framework in [20] has been generalized to continuous-state settings and further provides 
useful tools and techniques for our subsequent work, such as derivatives [21] , asymptotics [22J , 
concavity [23] of H(Y). 

Equipped with ideas and techniques from the above-mentioned work on (I) and (II), 
we are more prepared to make further progress towards the computation of the channel 
capacity. In particular, the ideas and techniques in [19] and [20] are vital to this paper. 
Roughly speaking, [20J proves that the entropy rate of hidden Markov chains is a "nicely 
behaved" function; and [19] confirms that it can be "well-approximated" using Monte Carlo 
simulations. The simulator of the derivative of I(X; Y) as specified in Section SJ which is 
crucial to this work, is an "offspring" of the two schools of thoughts in [20] and [TP"] . 

Stochastic approximation methods refer to a family of recursive stochastic algorithms, 
aiming to find zeroes or extrema of functions whose values can only be estimated via noisy 
observations. The extensive literature on stochastic approximation has grown up around two 
prototyipcal algorithms, the Robbins-Monro algorithm and the Kiefer-Wolfowitz algorithm, 
mainly concerning the convergence analysis on these two algorithms and their variants; we 
refer the reader to [31] for an exposition to the vast literature on stochastic approximation. 

Inspired by the ideas in stochastic approximation, we propose a randomized algorithm 
to compute the capacity of a class of finite-state channels with input Markov processes 
supported on some mixing finite-type mixing constraint. Bearing the same spirit as the 
Robbins-Monro algorithm and the Kiefer-Wolfowitz algorithm, the proposed algorithm, in 
many subtle respects, differs from both of them. The main task of this paper is to conduct a 
convergence analysis of the proposed algorithm, which employs some established ideas and 
techniques from the field of stochastic approximation. In particular, the proofs in Section [8] 
are largely inspired by [49J, which has credited origins of some of its techniques to earlier 
work, such as [311 [33]. However, neither the results nor the proofs in [19] or any of 
previous work imply our results; as a matter of fact, considerable amount of simplification 
and adaptation of the techniques in [19] have been incorporated into this work. 

Although described in different languages, our settings are essentially the same as in [S3] . 
On the other hand, as opposed to the GBAA, the concavity of I(X; Y) alone is already 
sufficient to guarantee the convergence of our algorithm. Here, let us note that that for 
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certain classes of channels (see Example 19. 4p . I(X;Y) is indeed concave with respect to 
certain parameterization, whereas H(X\Y) fails to be concave with respect to the same 
parameterization. 

Characterizing the maximal rate at which the information can be transmitted through 
a given channel, the capacity is the most fundamental notion in information theory. The 
capacity achieving distribution will further provide us insightful guidance towards designing 
coding schemes that actually achieve the promised capacity. Apparently, such an algorithm 
would be of fundamental significance to both information theoretic research and practical 
applications to tele-communications and data storage. 

The organization of the paper is as follows. We first describe our channel model in 
greater detail in Section [2] and we then present our algorithm in Section [3j In Section HJ 
we propose a simulator for the derivative of I(X; Y) and discuss its convergence behavior. 
The convergence of the algorithm is established in [5l while the rate of convergence of the 
algorithm with and without concavity conditions are derived in Sections [7| and [HI respectively. 
In Section [HI we discuss the capacity achieving distribution of a special class of finite-state 
channels. 

2 Channel Model 

In this section, we specify the channel model considered in this paper in greater detail, which 
is essentially the same as the one considered in [53] . 
Let X be a finite alphabet and let 

X 2 = {{i,j):i,jeX}. 

Let LT denote the set of all stationary irreducible first-order Markov chain over the alphabet 
X. For a given subset F C X 2 , define 

U F = {XeU:X id = 0, (i,j)eF}, 

where we have identified an irreducible first-order Markov chain with its transition proba- 
bility matrix. Furthermore, for any e > 0, define 

U F , £ = {X e U F : X id > e, (i,j)#F}. 

Obviously, if some X e Hf,e is primitive (namely, irreducible and aperiodic), then any 
X' G Hf,£ is primitive; in this case, we say F is a mixing finite-type constraint. Here, let 
us note that a mixing finite-type constraint can be defined in a much more general context; 
see [3lj. 

The motivation for consideration of finite-type constraints mainly comes from magnetic 
recording, where input sequences are required to satisfy certain mixing finite-type constraints 
in order to eliminate the most damaging error events [36]. The most well known example 
is the so-called (d, fc)-RLL constraint S(d,k) over the alphabet {0,1}, which forbids any 
sequence with fewer than d or more than k consecutive zeros in between two successive l's. 

In this paper, we are concerned with a discrete-time finite-state channel with some input 
constraint. Let X, Y, S denote the channel input, output and state processes over finite 
alphabets X, y and S, respectively. Assume that 
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([2ja) For some mixing finite-type constraint F C X 2 and some e > 0, X £ Il^ e . 

(EJb) (X, S 1 ) is a first-order stationary Markov chain whose transition probabilities satisfy 

where p(s n |x n , s n _i) > for any s„_i, s n , x n . 
(J2Jc) the channel is stationary, and the channel transition probabilities satisfy 

P(2/n; ^nl^n; Sn—l) P\^n\-^ny ^n— l)p(2/n|"^n) ^n)- 

The capacity of the above channel is defined as 

C F = sup I(X; Y) = sup lim I n (X; Y), 

n—>oo 

where the supremum is over all X satisfying (2. a) and 

H(X™) + H(Y{ 1 ) - H(Xf, Y?) 



I n (X;Y) 



n 



The fact that Y and (X, Y) are both hidden Markov processes makes it apparent that 
solutions to (I) and (II) are essential for computing Cp. 

Assume that n^ e is analytically parameterized by ^ 6 6 C l d , d > 1, where denote 
the entire parameter space. Then, naturally, X = X{9) and Y = Y(6) are also analytically 
parameterized by 9. Under this parameterization, we would like to find 9* £ such that 
X(9*) maximizes I(X(9);Y(9)). 

Remark 2.1. One natural goal is to find X £ lip to maximize I(X;Y). However, in this 
paper, we will restrict our attention to Il^ e for a given e > 0; such restriction will be justified 
in Section [91 

3 The Algorithm 

For a given 1/2 < a < 1, choose the so-called step sizes 

1 

n 

apparently, {a n } satisfies 

oo 

a n = oo, a 2 n < oo 



d n , Tt 1, 2, ■ ■ • , 



oo oo 



n=0 n=0 



which are the typical conditions imposed on step sizes in a generic stochastic approximation 
method. Letting A n denote the event u 9 n + a n g n b(9 n ) £" 0", we propose to find 9* through 
the following recursive procedure: 



9 n +\ 



9 n , if A n occurs, 

9 n + CLngn h {9n), otherwise; 



here b > 0, the initial 0q is randomly selected from 6, and g n b(8) is a to-be-specified simulator 
(see SectionH]) for I'(X(6); Y(8)), where the derivative is taken with respect to 9. Throughout 
the paper, we assume that 

0</3<a<l/3, 2a + b - 36/3 > 1; (2) 

here, a, (3 are some "hidden" parameters involved in the definition of g n b(8), which will be 
defined in Section HI 

4 A Simulator of /'(X; Y) 

As stated in Section [H albeit rather difficult to compute analytically, I n (X; Y) can be well- 
approximated via Monte Carlo simulations. In this section, we propose a simulator for 
I'(X; Y). Needlessly to say, an effective simulator guaranteeing an "accurate" approximation 
to I'(X; Y) is crucial to our algorithm. To some extent, our simulator is inspired by the 
Bernstein's blocking method jH], which is a well-established tool in proving limit theorems 
for mixing sequences; see, e.g., [H]. 

Now, consider a stationary stochastic process Z = Z 00 ^ satisfying the following assump- 
tions: 

Pa) There exist C, C" > such that for all z°_ n , 

C < pizolz- 1 ) < C" . 

(jib) There exist C>0, 0<A<1 such that for all n, 

^z{n)= sup \P(V\U)-P(V)\/P(V)<CX n , 

U£B(Z- n ),VeB(Z™),P(U)>0,P(V)>0 

where B(Zf) denotes the cx-field generated by {Z k : k — i, i + 1, • • • , j}. 

fl4jc) There exist C > 0, < p < 1 such that for any two z°_ m , z° A with z°_ n = z°_ n (here 
m, rh > n > 0), 

\v{zo\zZ l m ) -p{z Q \zZ] h )\ < Cp n . 

Remark 4.1. Conditions flUa)-flUc) are the same ones used in Section 2 of [19] . which are 
essential for establishing the main results in [19]. As observed in [19], Condition (T5Ja) implies 
that Y and (X, Y) both satisfy Conditions (IUa)-(Illc). 

Now, for < (3 < a < 1/3, define 

q = q(n) = n 13 , p = p{n) = n a , k = k{n) = n/{n a + n 13 ). 
For any j with iq + (i — l)p + 1 < j < iq + ip, define 

- wAz^ qm ) - - [jfz-^ + p(z^ qm+l \z^ m) + • • ■ + ^|^ /2j) J logp( ^ 
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and furthermore 

k(n) 

(i — W / ig+(i-l)p+l + • • • + Wi q+ i p , S n = C,i- 

i=l 

Now, we are ready to define our simulator for I'(X; Y). 
Definition 4.2. 

g n = g n (X?, Y?) 4 //'(X^) + S n (Y?)/(kp) - S n (X?, Y?)/{k V ). 

The following lemma, whose proof is somewhat similar to that of Lemma 3.3 in [19], g 
an estimate of the variance of S n . [19J. 

Lemma 4.3. For Z satisfying Conditions ^a), ^b) and §4\c), 

E[(S n -E[S n ]) 2 ]=0(kpq 3 ). 
Proof. As in [19], using Condition (JUa), (jUb), we can deduce that for some < A < 1, 

k k 

E[(S n - E[S n ]) 2 ] = E[C£0 ~ E*]) 2 ] = kE[(& - E[Q]) 2 ] + 0(k 2 X^ 2 ). 

i=l i=l 

So, to prove the lemma, it suffices to prove that for any i £ N, 

E[(C l -E[C l ]) 2 ]=0(pq 3 ). 

Note that 

k k 

mci - Em 2 } = e[(J2 w t - E[wm = j2 Em - EmWi - 

1=1 i,j = l 

It is apparent that when \j — i\ < [q/2\, 

E[(Wi - E[Wi))(Wj - E[W 3 })} = 0(g 2 ), 
and one verifies, using Condition (JUa), (JUb), that when \j — i\ > [q/2\, 

E[(Wi - E\Wi\)(Wj - E[W 3 })} = 0(q 2 X^-^). 
Combining ([3]), (jlj) and (jSJ), we then have 

E[(Q ~ E[Q}) 2 } = ( £ + £ )£[(Wi - SfWi])^ - E[Wj])] 

Ii-*|<L9/2J |i-i|>L9/2j 

= 0( M 3 ). 
The proof is then complete. 
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The following three theorems characterise the performances of our simulator from differ- 
ent perspectives. 

Using similar techniques as in the proof of Theorem 1 . 1 in [20] , the first theorem shows 
that on average, our simulator sub-exponentially converges to I'(X; Y). 

Theorem 4.4. For some < p < 1, we have 

E\g n {X^Yl l )}-I\X-Y) = 0{p^). 
Proof. Notice that for the Markov chain X, we have 

H(X)=H(X 2 \X 1 ). 

So, by Remark 14.11 it suffices to prove that for any Z satisfying Conditions (jHa)-(JUc), we 
have 

^~H\Z) = 0{ P ^\ 

for some < pi < 1. 

Note that for any j with iq + (i — l)p + 1 < j < iq + ip, we have 

Em = ~ E /'(:• , a ) (4^T • ^ '"I ^iH ) log**, :j \, 2 

^ j \^p(^-L?/2j) P(«i-H/2J+iFi-L?/2j) K%kj-[ 9/2 j) / 



J-U/2J 



3-1 \ 
J-L9/2J> 



Then, following [2D], we can prove that for any small £, we have 

j2\p'(zn\zr 1 )\ = o((i+eT). 

*1 

This, together with Condition flUc), implies that for some < p\ < 1, 

E[iy,]-F'(z) = o( P i 9/2J ), 

which further implies that for some < pi < 1 

^ _ = E[S n ] - fcp^(Z) = - H'{Z)) = , /2> 

kp kp kp 1 



□ 



The following large deviation type lemma gives a sub-exponentially decaying upper bound 
on the tail probability of g n (X™, Y™) deviating from I'(X; Y). 

Theorem 4.5. For any e > 0, there exists some < 7, 5 < 1 such that , 

P(K(Xr,F™)-/'(X;F)|>5)<7 ni . 



Proof. By Lemma 14.41 and Remark I4.1[ it suffices to prove that for any Z satisfying Condi- 
tions (jHa)-(jHc) and for any e > 0, there exist < 7, 5 < 1 such that 



S n — E[S n ) 



kp 



By the Markov inequality, we have 



As in [19], applying Conditions (jUa) and flUb), we then have 

E [ e t(S n -E[S n ])/p^ = E [ e t-£iZt(ti-m])/p 2 e t(C k -E[tk])/P 2 } 

= (1 + 0(A 9( ™ )/2 )) J B[e* E ti^- s ^ )/p2 ]E[e^], 
for some < A < 1. An iterative application of (jSJ) yields that for any < t < 1 
^gtCSn-BPn])/* 3 ] = ^[ e *Ei=i(Ci-B[Ci])/p 2 ] 

= (1 + 0(A 9(n)/2 )) fe_1 (£'[e* (Cl "- E[Cll)/p2 ]) A: , 

as n goes to infinity. By Condition (jUa), we have 

Ci - £7[Ci] = 0(pg), and thus, 0((£ - £[G])7p 4 ) = 0(g 2 /p 2 ) = o(l)« 
It then follows that for any < t < 1, 

£[ e *(Ci-£[Ci])/p 2 ] = l + (l)f2. 

Choosing i = n^ 1- ")/ 2 , then, by (J7J) and ([9]), we deduce that 
p ' S n -E[S n ] > \ < Eje^^™ 2 ] 



-jtkej'p 



< (1 + 0(A <?(n)/2 )) fc - 
= 0(e- 1/2 - 3Q/2 ). 



:i + (i)t 2 



1 + te + 0(l)t 2 )' 



(6) 



(7) 



(9) 



Noticing that < a < 1/3 (and thus 1/2 — 3a/2 < 0), we conclude that for any e > 0, there 
exists < 7, 5 < 1 such that 



With a parallel argument, one verifies that for any e > 0, there exists < 7, 5 < 1 such that 

p[ S n -E[S n ] ^ \ 4 
kp J 



which immediately implies fl6]). The proof is then complete. 



□ 



The following theorem states that our simulator is asymptotically unbiased. 
Theorem 4.6. With probability 1, 

g n {X^Y?)^l\X-Y), 

as n tends to oo. 

Proof. It immediately follows from Theorem 14.51 and the Borel-Cantelli lemma. □ 

Remark 4.7. In our notation, the following expression has been proposed in [53] as a 
simulator of I'(X; Y): 

r>'(Y n ) r>'(X n Y n \ 

H(X 2 \ Xl ) - ^ logp(y«)/n + ll^yl* logpW, YD/n. 

Extensive numerical experiments conducted in [53] suggest that this simulator converges to 
I'(X;Y) almost surely as n tends to infinity, however, there is no rigorous proof for the 
convergence. 



5 Convergence 

Consider the following condition: 

(5. a) P(fl^ =1 U^ fc A n ) = 0, that is, A n ,n e N, only occurs finitely many times, 

which will be assumed throughout the convergence analysis in the paper. Particularly, in 
this section, assuming (5. a), we will show that {I(X(9 n ); Y(9 n ))} converges almost surely. 
Note that if 9 = M d , then Assumption (5. a) will be trivially satisfied and the iteration in ([T|) 
can be simply written as 

n +i =0 n + a n g n b(6 n ). (10) 

In fact, unless specified otherwise, we will simply assume that 9 = K in all the proofs in this 
paper to avoid obscuring the main idea. The proofs of the same results under Assumption 
(5. a) follow from parallel arguments only with an increasing level of notational complexity. 
Henceforth, we will write 

f(9) = I(X(9); Y(6)), f n (9) = I n (X(9); Y(9)). 

Note that under Assumption (J21a), Theorem 1.1 of [20] implies that 

f{9) is analytic and each of its derivatives is uniformly bounded over all 9 e 0, 

a key fact that will be implicitly used throughout the paper. Now, rewrite ffTUl) as 

#n+i = d n + a n f'{9 n ) + a n R n (9 n ), (11) 

where 

R n {9 n ) = g n b(9 n ) — f (9 n ). 
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It can be easily verified that 

f(9 n+1 )- f{9 n ) = [ f'{9 n + t{9 n+1 -9 n )){9 n+1 -9 n )dt 

f(9 n )(9 n+1 - 9 n )dt + I (f'(9 n + t{6 n+l - 9 n )) - f'(9 n ))(9 n+1 - 9 n )dt 



= a n f(9 n )(f(9 n ) + R n (9 n )) + / (f(9 n + t(9 n+1 - 9 n )) - f'(9 n ))(9 n+1 - 9 n )dt 

Jo 

= a n f' 2 (9 n ) + R n (9 n ), (12) 

where 

R n (9 n ) 4 a n f{9 n )R n {9 n ) + / (f'(9 n + t(9 n+1 - 9 n )) - f(9 n ))(9 n+1 - 9 n )dt. 

Jo 



Lemma 5.1. Yl^Lo Rn(9 n ) converges almost surely. 
Proof. Let 

OO r.\ 



Ti = ^a n f(ft„)i? n (^), T 2 = J2 (f'(dn + t(9 n+1 -9 n ))-f'(9 n ))(9 n+1 -9 n )dt. 

n=0 n=0 ^° 



It suffices to prove that Ti,T 2 both converge almost surely. 
For Ti, note that 

OO 

T 1 = Y, a nf(G n )(g nb (9 n )-f(9 n )) 

n=0 

OO OO 

= ^a n f'(9 n )(g nb (9 n ) - f' nb (9 n )) + Y^a n f'(9 n )(f' nb (9 n ) - f{9 n )). 

n=0 n=0 

It follows from Theorem 14.41 that there exists < po < 1 such that 

OO OO 

ICflnlAMK&Pn) " < J2 a ^'^\P0 < °°- ( 13 ) 

n=0 n=0 

Then, using Lemma 14.31 one verifies that uniformly over all fl n G6, 

OO OO / 1 \ 

J2E[{al(f(9 n )yRl(9 n )}} = , (14) 

n=0 n=0 ^ ' 



which converges since 2a + b — 3bf3 > 1. Noting that {a n f'(9 n )R n (9 n ), £?(X")} is a Martingale 
difference sequence and applying Doob's Martingale convergence theorem (see Theorem 2.8.7 
of HH1), we deduce that 



X>n/W(Sn^nW:^n)) 



n=0 

11 



converges with probability 1. The almost sure convergence of T\ then follows. 
For T 2 , it is easy to check that 

\f(0 n + t(6 n+1 - 6 n )) - f(6 n ))(6 n+1 - 6 n )dt = 0((6 n+1 -d n ) 2 ) = 0(a 2 n (f(d n )) 2 )+0(a 2 n R 2 n (6 n )). 
Similarly as in deriving ( [13]) and ( !T4l) . we have 

OO CO 

~ f^)) 2 < OO, ^E[^(^(^ n ) - f' nb (6 n )) 2 ] < OO, 
ra=0 n=0 

and furthermore, 



*£a 2 n (g nb (0n)-fU0n)) 2 

n=0 

converges almost surely. This, together with (fT3"j) . further implies that 

OO 

Y,a 2 n \(gAo n ) - f'AOnWAOn) - f\e n )\ 

n=0 

converges almost surely. Recalling that 
we conclude that 

OO 

£oX(0 B )<oo, 

n=0 

which further implies that 

oo „i 

Y, / (/'(*» + - ^)) - f'WnWr+l ~ On)dt 

n=0 J ° 

converges almost surely. The proof is then complete. □ 

We are now ready for the following convergence theorem, whose proof closely follows that 
of Lemma 7 in [19] , which can be further traced back to the standard proof of the Martingale 
convergence theorem 



Theorem 5.2. With probability 1, we have 

lim f'(6 n ) = and lim f(8 n ) exists . 

n— >oo n—too 

Proof. Recall that 

/(0 n+ i) - f(6 n ) = a n f' 2 (6 n ) + R n (6 n ), 
an iterative application of which implies 

n— 1 re— 1 

i=0 i=0 
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Applying Lemma [5. 1[ we deduce that with probability 1, 

oo 

5>(/'(^)) 2 < oc, 

i=Q 

which, in return, implies that lim^oo f(9 n ) exists and furthermore there is a subsequence 
{9 n .} such that f'(9 nj ) converges to as j tends to infinity. 
We now prove that 

lim f'(9 n ) = 0. 

n— >oo 

By way of contradiction, suppose otherwise. Then, there exists e > such that there exist 
infinite sequences mj,, n^, k = 1, 2, • • • , such that 



\f(9 mk )\<s, \f(9 nk )\>2e, \f(9 i )\>s 
for all m k + 1 < i < n k . It then follows that 



(15) 



e<\f'(9 nk )-f(9 mk 
= 0{\9 nk -9 mk \) 

tn k — l 
E 0,1/w 
j=m fe 



n fe -l 

E 



o 



\i=m k 



n k -l 



E a *-^ 



t=m k 



(16) 



As in the proof of Lemma I5.1[ using the decomposition 

R n (9 n ) = g n b{9 n ) - f'{9 n ) = g n b{9 n ) - f nb {9 n ) + f' nb {9 n ) - f'{9 n ) } 

we deduce that Y^=o a nR n {9 n ) converges almost surely, and hence |^^:~^ ciiRi{9i) | tends 
to as k goes to oo. On the other hand, by (|T5l) . we have 

n k ~l oo 

E °* ^ E ^(/'(^)) 2 - 

i=m k i=m k 

This implies that as A; tends to oo, X)2m a * tends to zero, which, together with ffTB"]) . further 
implies that 



e< lim \f'(9 nk )-f( 

k— >oo 



'm k , 



a contradiction. 



□ 



Remark 5.3. The fact that {f{9 n )} converges almost surely does not necessarily imply 
that {9 n } converges almost surely. In the remainder of this paper, we will prove, under some 
assumptions, that {9 n } does converge almost surely. 
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6 Some Estimations 

In this section, assuming (|5la), we will derive some estimations that will be used in the later 
sections for convergence analysis. 
For any j 6 N, let 

Aj = di + a 2 + h Oj-i, 

and for any h > and any n6N, define 

£(n, h) = min{fc : a n + a n+ i + • • ■ + a^_i > h}. 

Now, for any fixed n G N, recursively define 

= t(n k ,h). 

One then verifies that for fc sufficiently large, 

A„ fc+1 -A nfc = 6(/i), n fc = 0(/c 1/(1_a) ), (17) 
where by M = O(N), we mean that there exist positive constants C±, C 2 such that 

dN <M< C 2 N. 

Now, an iterated application of 

@n+i — @n = a n f'(9 n ) + a n R n (9 n ) 

yields 

fc-l fc-l 
i=n i=n 

fc-l fc-l 

= n + (A k - A n )f'(e n ) +J2aiRi(8i) + " 

where 

fc-l fc-l 

#n,fc = 2°*^) + X^C/W " /'(^))" ( 18 ) 

i=n i=n 

Similarly, an iterated application of 

f(0n+i) ~ f(0n) = a n f 2 (6 n ) + R n (6 n ) 
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yields 

f{O k ) - f{9 n ) = f f'{6 n + t(B k - 9 n )){9 k - 9 n )dt 
Jo 

f'(9 n )(9 k - 9 n )dt + / (f'(9 n + t(9 k - 9 n )) - f'(9 n ))(9 k - 9 n )dt 
Jo 

= f'(O n )((A k - A n )f(9 n ) + R nM ) + / (f(9 n + t(9 k - 9 n )) - f'{9 n ))(9 k - 9 n )dt 

Jo 

= (A h - A n )f' 2 (9 n ) + f\9 n )R n , k + / (f'(9 n + t{A k - A n )) - f{9 n )){A k - A n )dt 

Jo 

= (A k -A n )f' 2 (9 n ) + R ntk (9 n ), (19) 

where ^ 

RnAOn) = f(9 n )R n , k + [ (f'(9 n + t(A k - A^) - f(9 n ))(A k - A n )dt. (20) 

Jo 

The following lemma introduces a positive random variable, Co, and a constant, r, which 
will be referred to throughout the rest of the paper. 

Lemma 6.1. There exists a positive random variable C such that for all n and for any 
t > with 2a + b - 36/3 - 2r > I, 

< Con~ T a.s. 

Proof. For any r > with 2a + b — 36/3 — 2r > 1, as in the proof of Lemma [5. 11 we deduce 
that Ylili i T(X iRi converges almost surely. Letting 

n 
i=l 




sup 

k>n 



y^^iRii 
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we then have for any k > n, 

k 



^OiRiiOi) = ^(i T a i R i (O i ))i- T 

i=n i=n 
k 

= £)(t, - r w )r 

i=n 

k k 

i=n i=n+l 
k k-1 

i=n i—n 
k-l 

= Tkk -r + J2(t--( i + l)-)T l 



i=n 
k-l 



^ ( k ' T + Z)( i_T - (* + 1 )" r )) su p T < 

i 

i=n 

= rT T sup Tj, 

i 

which immediately implies the lemma. □ 

In the following, to avoid notational cumbersomeness, we will use C to denote a positive 
constant, which may not be the same on its each appearance. 

Lemma 6.2. Let < h < 1 and Cq,t be as in Lemma \6.1\ then we have 

(1) there exists a constant C > such that 

\f(e t{n>h) )\ <c(c n- T + \f(e n )\). 

(2) there exists a constant C > such that 

\0t(n,h)-O n \ <C(C n- T + h\f'(6 n )\). 

(3) there exists a constant C > such that 

\Rn,t(n,h)\ < C(C n- T + h 2 \f\6 n )\). 

(4) there exists a constant C > such that 

\Rn,t(n,h)\ < C(C 2 n- 2r + C Q n- T \f(9 n )\ + h 2 \f\9 n )\ 2 ). 

(5) there exists a constant C > such that 

f(0 n ) - f{9 t{n , h) ) < -(3/4 - 3Ch/2)h\f(6 n )\ 2 + CC 2 n~ 2T (l + l/(2h 2 )). 
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(6) there exists C > such that for sufficiently small h 

2(/(0„) - + \f(8n)\\8t(n,h) ~e n \ < (C+l/(2h 2 ))C 2 n- 2 \ 

(7) for any t' < t, there exists a positive constant C such that for sufficiently small h, we 
have 

\0 t{ n,h) - e n \ < Cn T '(f(8 t(n>h) ) - f{6 n )) + CC 2 n- T '. 

Proof. In this proof, for notational simplicity, we will write t(n, h) as k. 
Note that there exists a positive constant C such that 

\f(o k )\<\f(e n )\ + \f(e k )-f(e n )\ 

< \ f\e n )\ + c\e k -e n \ 

k-l k-l 

< \f{0n)\ +Cj2ai\f'(8i)\ + C\J2 a ^ ' 



i \ w i 1 1 j 

i=n 



where we have applied (TIT]) . Applying Lemma [6. 1\ we then have 



k-l 



\f'(0k)\ < CC n- T + +Cj2^i\f ( 



Applying Gronwall's lemma, we then have for n sufficiently large 

\f'(0 k )\ < (CC Q n- T + \f\9 n )\)exp(C(a n + a n+1 + --- + a k ^)) <exp(C)(CC Q n- T + \f'(9 n )\), 
where we have used the fact that for n large enough 

a n + a n+1 H h a fc _i « h < 1. 

We have then established (1). 

It then follows from (1) that for some C 

k-l k-l 



i=n i=n 

< (A k - A n )(CC n- T + C|/'(0 n )|) + C Q n-\ 



which immediately implies (2). 

Now, by ffTgj) and (2), we have for some C 



k-l 



\R n ,k\ < C n T + C ai\$i — 6 n \ 

i=n 

< C n- T + C 2 (A k - A n )(C n^ + (A k - A n )\f'(9 n )\), 

which establishes (3). 
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Furthermore, by ( 120]) . (2) and (3), we have 

\Rn,k\ < \ f'{9n)\\Rn,k\ + C\9 k — 9 n \ 

< CC Q n-*\f'(e n )\ + C(A k - A n f\f{9 n )\ 2 + 2C\Cln^ + (A k - A n ) 2 \f'(9 n )\ 2 ), 

which establishes (4). 

It then follows from (|T9j) . (Ill) and (IV) and that for sufficiently large n 

f(0 n ) - f{9 k ) < -(A k - A n )\f'(9 n )\ 2 + \R n>k \ 

< -3h/4\f'(9 n )\ 2 + C(C 2 n- 2 ^ + C n^\f(9 n )\ + h 2 \f\9 n )\ 2 ) 

< -3h/A\f'(9 n )\ 2 + C(C 2 n- 2 ^ + C 2 n- 2 ^/(2h 2 ) + h 2 \f{9 n )\ 2 /2 + h 2 \f(9 n )\ 2 ) 

< -3h/4\f'(9 n )\ 2 + C(C 2 n- 2T (l + l/(2h 2 )) + 3h 2 /2\f {9 n )\ 2 ) 

< -(3/4 - 3Ch/2)h\f\9 n )\ 2 + CC 2 n- 2c (l + l/(2/i 2 )), 

which establishes (5). 

It follows from ([TED, ( 3 ) and ( 4 ) that 

f(0 k ) - f(0 n ) = \f'(0n)\\(A h - A n )f'(9 n )\ + R n>k 

= \f {Qn)\\9k — 9n + Rn,k\ + Rn.k 

> \ f'(@n)\(\6k — ®n\ ~ \Rn,k\) ~ Rn.k 

> \f'(0n)\\9k - 9 n \ - C(C 2 n- 2T + C n- T \f\9 n )\ + (A k - ^„.) 2 1 r(^) | 2 ) , 
which implies that 

f{0n) - f(9 k ) + \f'(0 n )\\9 k - 9 n \ < C(C 2 n- 2T + C n- T \f'(9 n )\ + (A k - A n f\f{9 n )\ 2 ) 

< C(C 2 n- 2T (l + l/(2h 2 )) + 3h 2 /2\f\9 n )\ 2 ). 

Applying (V), we then have for sufficiently small h, 

f{0 n ) ~ f(0 k ) + \f'(0n)\\e k - 9 n \ < 2C(1 + l/(2h 2 ))C 2 n~ 2T + f(9 k - f(9 n )), 

which can be rewritten as 

2(/(0n) - f%)) + \f\e n )\\9 k - 9 n \ < 2C(1 + l/(2h 2 ))C 2 n- 2T , 

which establishes (6). 

We next prove (7). If \ f'(9 n )\ < n~ T , applying (II), we deduce that 

\9k - 9 n \ < CC n- T ' + Ch 2 n- T '. (21) 

It follows from ([19]) and (4) that 

\f(0 k ) - f(9 n )\ < (A k - A n )f 2 (9 n ) + \R n , k {9 n )\ 

< (A k - A n )f' 2 (9 n ) + C(C 2 n- 2T + C»n- T \f{e n )\ + h 2 \f\9 n )\ 2 ) 

< (A k - A n )f' 2 (9 n ) + C(C 2 n^' + C n^'\f(9 n )\ + h 2 \f\9 n )\ 2 ) 

< C(C 2 n- 2 ^'(l + l/(2h 2 ))) + (h + 3Ch 2 /2)\f'(9 n )\ 2 , 
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which, together with (I2ip . immediately implies that for some C, 

\0k ~ 0n\ < n T '(f(9 k ) - f{9 n )) + n T '\f(9 k ) - f(9 n )\ + CC Q vT T ' + Ch 2 n^' 
< n T '(f(6 k ) - f(9 n )) + C(C 2 n-^'(l + l/(2h 2 ))) 

+ (h + 3Ch 2 /2)\f'{9„)\ 2 + CC n~ T ' + Ch 2 rT T ' '. (22) 
On the other hand, if \ f{0 n )\ > rT T \ applying (6), we deduce that 

\e k - e n \ < 2\f{e n )\-\f{e k - f{e n ))) + (c + i/(2h 2 ))\f(e n )\- 1 c 2 n- 2T 

< 2n T '(f(9 k - f(9 n ))) + (C+ l/(2h 2 ))C 2 n- T '. (23) 
Combining (122]) and ( J23l) . we then have established (7). □ 

7 Rate of Convergence with Concavity 

In this section, we assume that 

(7. a) f(8) is strictly concave with respect to 9. More precisely, there exists e > such that 
for any 1; 9 2 G ©, 

fl(t9i + (l-t)e 2 )>e, 

for all < t < 1. 

(7.b) With probability 1, converges to the unique global maximum 9* as n tends to oo. 

Here, let us note that (7. a), together with Theorem I4.5[ implies ([5ja). With Assumptions 
(7. a) and (7.b), which, as argued in Section |9j can be satisfied for a class of finite-state 
channels, we will derive the convergence rate of {9 n }. Again, for notational convenience 
only, we assume that = K. in the proofs. 
From 

9 n +i — 9 n = a n f (9 n ) + a n R n (9 n ), 

trivially we have 

A n+ i — A n = — a n f (9 n ) — a n R n (9 n ), 

where 

K = {9*-9 n ). 

It immediately from the above two conditions that for 9 sufficiently close to 9* 

f{9) = 6(\9* - 9\ 2 ), f{9) = 6{\9* - 9\). (24) 

So, if 9 n is sufficiently close to 9*, we will have 

f(9 n )=0(A 2 n ), f(9 n )=6(\A n \). 

Throughout the paper, by M = O(N), we mean that there exists a positive random 
variable C such that with probability 1, 

\M\ < CN. 

In this section, we will prove that A n is at most of order 0(n~ T ). 
We first prove the following lemma. 
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Lemma 7.1. There exists I G N such that 

lim inf n T | A n | < ICq. 

n— >oo 

Proof. Suppose, by way of contradiction, that for any Z, 

n T |A n | > ICq, (25) 



as long as n is sufficiently large. First, pick no sufficiently large such that (1251) is satisfied 
and then recursively define 

n k +\ = t(n k ,h). 

for some < h < 1. We then have, for any feasible k, 

®rik+i ~ ®n k + (A nk+1 — A„ k )f (0 nk ) + Rn k ,n k+1 - 

It then follows from Lemma 16721 (3) and fl24l) that R nk ,n k+1 is dominated by |/'(0n fc )| as l° n g 
as / is chosen sufficiently large and h is chosen sufficiently small. Noticing that due to the 
concavity of /, A n always has the same sign as f'(9 n ), then we have 

|A nfc+1 | < |A n J -h/2\A nk \ < |A„ fc |e-^ 2 , 

an iterative application of which would yield 

A < A e~ kh / 2 

It then follows that for any k 

A no nle- kh/2 > n T k A nk > IC . 
This, together with the fact that (see (1170 ) 

n k = Oik 1 ^), 

as k tends to infinity, implies that 

C < 0, 

which is a contradiction. □ 
Theorem 7.2. 

|A n | = 0(n- T ). 

Proof. It is enough to prove that there exists an integer I such that for all n sufficiently large, 

n T |A n | < IC . 

By way of contradiction, suppose otherwise. Then, by Lemma I7.1[ for any / and arbitrarily 
large N, we can find ko > m>o > N such that 

m T A mo < 21C , k T A ko > 3lC , 
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min n T A n > 21Cq, max n T A n < 31Cq. (26) 

mo<n<ic mo<n<fco 

Now, for some < h < 1, let m x = t(m , h). Then, for any m < n < m 1; it follows from 
(125]) and 

$n — ^mo = (^n — A mo )f (0 mo ) + R m0t ni |-Rm, ,n| ^ C( m ^0 + (^n — ^m ) |/ (^mo)l) 

that 

I An- A mo | =O(m T )C . 
Applying (1261) . we then deduce that for sufficiently small h 

\n T A n - mgA mo | < n T \A n - A mo | + (n T - mo)A mo 

< O(m T )O(m T )C + o(m T )2lm T C , 

where we have used the fact that 

n T = O(mo), nT ~ m o = °( m o)- 
It then follows that, with / large enough and h small enough, we have 

|77 T A„ - 777,0 A OTo I < ^Co- 
in particular, we have 

I (m + l) T A mo+ i - moA mo | < IC and |m[A TOl - mo"A mo | < IC Q , 
which further implies that 

m T A mQ > IC and mi < k , 

respectively. 

Now, for some < h < 1, we have 

^tni ^mo (^mi ^mj) f (^mo) RiriQ,mii 

and 

|-Rmo,mi| < C(777 T Co + (^4 mi — A mo ) \f (9 mo )\). 

As in the proof of Lemma I7.1[ if I is chosen large enough, then \f'(6 mo )\ will dominate 
\Rm a , mi \- Again, due to the concavity of /, A mo always has the same sign as f'(9 mo ), then 
for sufficiently small h > 0, we have 

I Til 

| < |A mo | - /i/2|A 
Then, for mo sufficiently large such that 

ml <ml/(l-h/2), 

we have 

m[|A mi | < mJI |(1 - h/2) < ml\A mo \ < 21C , 
which is a contradiction to (j26l) . 

□ 
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8 Rate of Convergence without Concavity 

In this section, assuming (jSJa) and 

(8. a) with probability 1, 6 n G Q for all n, where Q is a compact subset of 0, 

we derive the rate of convergence of our algorithm. Again, for notational convenience only, 
we assume that = R. 

As one of the main results in real algebraic geometry, the Lojasiewicz inequality [9], 
among many other applications, has been widely applied to the convergence analysis of a 
broad class of dynamical systems. In this section, we will first use the "function" version of 
the Lojasiewicz inequality (Lemma 18. ip to prove that {f{0 n )} converges almost surely and 
derive the convergence rate, which can be further used to derive the convergence rate of {6 n }. 
Then, using the "variable" version of the Lojasiewicz inequality (Lemma 18. T[) . the rate of 
convergence can be refined. The above-mentioned framework is essentially due to Tadic [49J , 
however, a comprehensive adaptation to our settings has been done in this section. 

Following [19], we state the "function" version of the Lojasiewicz inequality as below. 

Lemma 8.1. For any compact set Qc9 and real number z G f{Q), there exist real numbers 
5 Q , Z G (0, 1), jj QiZ G (1, 2] and M QiZ G [1, oo) such that 



We first prove the following lemma. 

Lemma 8.2. There exists a positive integer I such that for all n sufficiently large, 



\f(e)-z\<M Q , z \f(o)\^ 



for all 9 G Q satisfying \f{6) — z\ < 8q, 




Define 



n^ T A n > -IC; 



Proof. Suppose, by way of contradiction, that for any /, there exists some no, 



TV. 



An < ~ICq- 



(27) 



Then, by Lemma [6.21 (5), we have, for some < h < 1 



f(0 no ) ~ /( W)) < "(3/4 



3Ch/2)h\f'(6 no )\ 2 + CC 2 n 2T (l + l/(2h 2 )) 



which implies for h sufficiently small, 



) - A no < -(3/4 - 3Ch/2)h\f'(9 no )\ 2 + CC 2 n 2T (l + l/2h 2 ) 
< -h/2\f'(d no )\ 2 + CC 2 n^(l + l/(2h 2 )). 
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Choosing / sufficiently large, then by Lemma [87T1 and ( 127|) . we deduce that for n large enough, 

-h/2\f{9 no )\ 2 + CCW T < -h/4\f'(9 m )\ 2 , 

and therefore 

A* (no , ft) - A no < -/i/4|/'(0 2 - (28) 

We then have 

A t(no>h) < A no < -lC»n^ T < -lCgt(n ,h)->* T . 
Henceforth, recursively define 

n k+ i = t(n k , h). 

It then follows that for any k, 

A nfc < A no < -lC£n^ T < 0, 
which is a contradiction to the fact that almost surely 

lim A nk = 0. 

k— >oo 

□ 

In the remainder of this section, define 

f = min(/ir, — a)/(2 — //)). 
Lemma 8.3. There exists a positive integer I such that 

liminfn f A„ < IC% 

n—¥oo 

almost surely. 

Proof. Suppose, by way of contradiction, that for any /, we have 

rf A n > ICS, (29) 

for all n sufficiently large. By Lemma 16.21 (5), for any < h < 1, we have for uq large 
enough, 

f(0 no ) ~ f(e t (n ,h)) < "(3/4 - 3Ch/2)h\f(6 no )\ 2 + CClnf^l + l/(2h 2 )), 

which implies that for h sufficiently small 

A t{noA) - A no < -(3/4 - 3Ch/2)h\f'(9 no )\ 2 + CClnf T {\ + l/(2h 2 )) 
< -h/2\f'(6 no )\ 2 + CCW^i + l/(2h 2 )). 

Choosing I sufficiently large, then by Lemma [8.11 and ( 12"7|) . we deduce that sufficiently large 

-h/2\f'(e no )\ 2 + C7Cfo* < -h/A\f\e no )\ 2 , 
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and therefore 

k tM) - K no <-h/A\f{9 no )\ 2 - 

Now, recursively define 

n k+ i = t(n k , h). 
An iterated application of f l30|) yields for some constant C\, 

We then have two cases: 

Case /i = 2: For this case, we have, 

A nk+1 < (1 - C^A^. 

Recursively, we deduce that 

A nk <A no (l-Cxh) k < A no e-^ hk , 
which implies that for any k, 

A n A T e- Clhk > K T A nk > ICS. 

This, however, will yield Co < when we take k to oo, which is a contradiction. 
Case jj, < 2: For this case, it follows from 

that 

f Ank 1 /" A " fc 1 A nk - A n 

I —^-du > / tin = > C7i, 

which implies that for some positive constant C2 

A -2/ M+ l _ ^-2/H-l > C- ^ 

Recursively, we deduce that 

A- 2 J» +1 > A-^ +l + C 2 hk, 

and furthermore 

It then follows from ([29]) and flHD that 

(7~ 2 +m/(-2+m)/m > ^(-2+M)/At^(- 2 + M )/ Ai > o(kn t k { -~ 2+ ^^) > 0(k f (~ 2+ ^ / (~ a+1)fl+1 ) 
Now, one verifies that this gives us an contradiction if we take k, I to 00, as long as 
f < n(l - a)/(2 - /j), equivalents f(-2 + n)/(-a + + 1 > 0. 
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Lemma 8.4. There exist an integer I such that for all n sufficiently large, 



n f A n < ICl 



Proof. By way of contradiction, suppose otherwise. Then, by Lemma 18. 3[ for any I and 
arbitrarily large N, we can find ko > mo > iV such that 

m oA mo ^ 2/C*o, A; Afc > 31C , 

min n f A n > 2lC$, max n f A n < 3ZCg. (31) 

mo<n<fco mo<n<fco 

For some < /i < 1, let m x = t(m ,h). For any m < n < as in the proof of 
Theorem 18. 2\ we derive 

A n -A mo <-h/4\f'(6 mo )\ 2 , (32) 
which, together with Theorem 18.21 and (13"Tj) . implies that 

/' (# mo ) 2 < 4M 2 o(v) + cgo« T ). 

which, together with (fT§|) . further implies that for some C > 0, 

|A n - A mo | < Ch\f'(e mo )\ 2 + CC 2 m- 2r < C 2 O(m +) + C 2 0{m^ T ) + CC^m^. 

It then follows that for sufficiently small h 

\n T A n - m T A rno \ <n T \A n - A mo \ + (n T - m T )A rno 
= 0(ml)(A n - A mo ) + o(mt)A mo 

< icl 

where we have used the fact that 

n T = O(mg), n T — m\ = o(mj). 

In particular, we have 

I (mo + l) r A mo+ i - moA mo | < IC and |m[A mi - moA mo | < ICq, 
which further implies that 

mo"A mo > ICq and mi < ko, 

respectively. 

Setting n = mi and rewriting ( 13"2|) . we have for some constant C\, 

A mi - A mo < -C^A 2 '* . 

We then consider two cases: 

Case fi = 2: For this case, we have for some positive constant Ci, 

A mi < (1 - C x h)A 
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Then for mo large enough, 

m{K mi < (1 - dh)mt A mo = (1 - d^m^l + o(l))A mo < 2Z(7 2 , 
which yields a contradiction. 



Case /i < 2: For this case, as in the proof of Lemma I8.3[ we have for some positive 
constant C*2, 

It then follows from ( 13~TT) and ( TT71) that for Z sufficiently large 

A^^" > (2/C 2 ) ( - 2+ ^ /M m a+1 + C 2 /i > ^ZC^^^m^ 1 , 
which implies that 

m[A mi < 2lCl 

a contradiction. 

□ 

The following theorem characterizes the rate of convergence of {f{0 n )}. 
Theorem 8.5. With probability 1, we have 

\A n \=0(n-*). 

Proof. It immediately follows from Lemmas 18.31 and 18.41 □ 

In the rest of this section, assuming 
(8.b) [it > (1 - a), 

we prove {0 n } converges almost surely. Here, let us note that (8.b) can always be satisfied 
if a, b, j3 are appropriately chosen such that r is sufficiently large. 

The following theorem characterizes the rate of convergence of {6 n }. 

Theorem 8.6. Assume that (8.b). Then, we have 

sup\6 k -6 n \ = d(n-^-«yV). 

k>n 

Proof. In this proof, we set 

r' = (f + (l-a))/2. 
For some < h < 1, starting from a fixed no, recursively define 

n k +\ = t(n k ,h). 
Then, to prove the theorem, it suffices to prove that 

sMv\e nk -e nm \ = d{n^ +{l - a),2) )- (33) 
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Now, applying Lemma I6T21 (7), we deduce that for some C > 



\e m+1 - e ni \ < cn\ (f(e m+1 ) - f(e m )) + cci 



2 n7 T ' 



It then follows that for any m < k, 



k-l 



i=m 

k-l k-l 

i=m i=m 
k-l k 

<CC 2 J2nr' + C {< -nti)He ni )\+Cn^\u{e nm )\+Cn]:\u{ 

i=m i=m+l 

Applying ( fl7l) . we deduce that 

k—l k—1 



'n k/ 



m ,. 



%=m i=m 
k 



«' " nti)H9 ni )\ = J2 0{(i - i)v/(i-)-i,-*/(i-)) = 0(m »/(i-.)-f/(i-) )j 

i=m+l i=m 

n T Ju(9 n J\ = 0{m r ''^m-^^) = Oim^'-^l^), 
ni\u(6 nh )\ = 0(k T '^- a) k- f /^) = 0(k^'~ f) ^- a) ). 

We then immediately conclude that 

\°n k u n m \ ^\ n m )■> 

which immediately implies (|33|) . 

□ 

The following "variable" version of the Lojasiewicz inequality will be used to refine the 
rates of convergence of {9 n } and {f{0 n )}. 

Lemma 8.7. For each 9 G 0, there exist real numbers 5g G (0, 1), fig G (1,2], Mg G [l,oo) 
such that 

\f{9')-f{9)\<M e \\f{9')\r 
for all 9' G G satisfying \\9' — #|| < 5g. 

Theorem 18.61 implies that with probability 1, {9 n } converges. From now on, let 9 = 
\im n ^ 00 9 n and set \i = fi§. Then, with this redefined /i, going through exactly the same 
arguments as in the proof of Theorems 18.51 and 18.61 we have the following two theorems. 

Theorem 8.8. For the above redefined \i, Theorems \8. 51 holds. 

Theorem 8.9. For the above redefined fi, assume (8.b). Then, we have 

\0 n _0\ = O( n - i ^ 1 - a)l2) ). 
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9 Capacity Achieving Distribution of a Special Class 
of Channels 



In this section, we restrict our attention to a special class of input-restricted finite-state 
channels with certain parameterization and we prove that for such channels operated at 
high SNR regime, the capacity will only be achieved at the interior of the parameter space 
and our algorithm converges almost surely. 

More specifically, recalling X, Y denote the input, output processes of the channel over 
finite alphabets X, y, respectively, we consider a class of parameterized memoryless channels 
such that 

(9. a) the channel only has one state; in other words, at any time slot, the channel is charac- 
terized by the conditional probability p(y\x). 

(9.b) for some mixing finite-type constraint F C X 2 , X £ Up. 

(9.c) the channel is parameterized by e > such that for each x and y, p(y\x)(e) is an 
analytic function of e > 0, which is not identically 0. 

(9.d) there is a one-to-one (not necessarily onto) mapping $ : X — > y, such that for any 
x £ X, p($(x)\x)(Q) = 1. 

(9.e) X is parameterized as in |53j, that is, 

6 = (p(X 1 = w 1} X 2 = w 2 ) : Oi, w 2 ) & F). 



Under the above assumptions, e can be regarded as a parameter that quantifies noise, and 
is the noiseless output corresponding to input x. The regime of "small e" corresponds 
to high SNR. Note that the output process Y = Y(X, e) depends on the input process X and 
the parameter value e; we will often suppress the notational dependence on e or X, when 
it is clear from the context. Prominent examples of such families include input-restricted 
versions of the binary symmetric channel with crossover probability e, denoted by BSC(£)„ 
and the binary erasure channel with erasure rate e, denoted by BEG(e). 

General SNR regime. By using an asymptotic formula of I(X; Y), we show that for 
the above-mentioned channels, the capacity achieving X must be primitive. 

Assume that X has period e with period classes D±, D 2 , . . . , D e . Then, by the classical 
Perron- Frobenius theory, after necessary reindexing, its transition probability matrix n can 
be written as 



Di 
D 2 

D P 



/ o 




\5e 



D 2 
B l 







D 3 


B 2 






A 





\ 



(34) 



B P 



/ 
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where we used the period classes to index the sub-blocks. In the following, let B denote the 
set of all entry indices of II corresponding to some B k , that is, 

B = : i G D k ,j G D k+1 , for k = 1, • • • ,e- 1} U {{i,j) : i G D e ,j G D 1 }. 

Now, consider an analytic perturbation 11(5) of II, 8 > 0, where 

(9.f) n(o) = n ; 

(9.g) for some (i, j) G B, Hij(5) is not identically 0; 
(9.h) for any 5 > 0, 11(5) is still a stochastic matrix. 

In other words, some non-i?-entries in II are analytically perturbed; as a result, Y is per- 
turbed from Y(0) to Y(5). The following theorem describes the asymptotic behavior of 
H(Y) under such a perturbation. 

Theorem 9.1. Under the aboved-mentioned perturbation as in (9.f)-(9.f), there exist Ci,C 2 > 
such that 

CiSlogl/8 < H(Y(5))-H(Y(0)) < C 2 5 1/2 . 

Proof. The proof is postponed to Appendix [A] □ 

Remark 9.2. It follows from Condition (9. a) that H(Y\X) is linear with respect to p. 
Theorem 19 .1[ together with this fact, implies that there exist C\, C 2 such that 

d6 log 1/5 < I(X{S);Y(6)) - IpT(O); Y(0)) < C 2 b x l\ 

which implies that, for any irreducible but not primitive X, any perturbation of X as in 
(9.f)-(9.h) will strictly increase the mutual information. So, we conclude that the capacity 
achieving X must be primitive, and thus Condition ()2ja) holds. 

High SNR regime. At the high SNR regime, that is, when e is close to 0, it has been 
established in [24] that there exists e > such that 

(9.i) I(X; Y), when restricted on X G Il^g, is strictly concave with respect to 9 G ©. 
(9.j) the capacity of the channel can be uniquely achieved within U-F,e- 
As a consequence, we have the following theorem. 

Theorem 9.3. For the channel as in (9.a)-(9.d) operating at the high SNR regime and 
sufficiently small e, under the iteration in (Q]) ; {9 n } converges to the capacity achieving 
distribution with probability 1. 

Proof. Note that Condition (9. a) and Theorem 15.21 imply Conditions ([71a) and ([TJb); and 
Condition(9.b) implies that the global maximum 9* indeed corresponds to the capacity 
achieving distribution. The theorem then immediately follows. □ 
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Example 9.4. Consider a binary symmetric channel with crossover probability e > 0. Let 
X be a binary input Markov chain with the transition probability matrix 



(35) 



1 — 7T 7T 
1 

where < ir < 1. Apparently, X is supported on the so-called (l,oo)-RLL constraint [3] 
which simply means that the string "11" is forbidden. Let Y denote the corresponding output 
process. Assume that X is parameterized by 9 = (p(00),p(01),p(10)), where p(10) = 1 is in 
fact a constant. It can be checked that Conditions (9.a)-(9.d) are all satisfied, so when e is 
sufficiently small, Conditions (9.i)-(9.j) are satisfied and thus Theorem 19.31 holds. 
On the other hand, it has been shown that for the output process Y, as e — > 0, 

H(Y) = H(X) + ^LZll £ log( i/ £ ) + o(e), (36) 

1 + 7T 

where the 0(e)-term is analytic with respect to p (see Theorem 2.18 of [21]). It then follows 
that 

H(X\Y) = H{X) + H(Y\X) - H(Y) = H(e) - ^- e Iog(l/e) + 0(e), 

1 + 7T 

where H(e) = e log l/e + (1 — e) log 1/ (1 — e). One can readily verify that — 7r(2 — it)/ (1 + 7t) is 
strictly convex with respect to 8, which implies the strict convexity (rather than concavity) 
of H(X\Y) when e is small enough. So, the concavity conjecture in [53] is not true in general, 
and thus the conditions guaranteeing the convergence of the GBAA are not satisfied. 



Appendices 



A Proof of Theorem 19.1 

First of all, we define 

'0 (Xi(5),X i+1 (8)) e B for alH e {1, • • • ,n- 1} 

Z ( <) ) = Z ( X' [' ( <) ) j = { 1 (\,(d).\i. :{<))) •/ li for exactly one i G { 1 , ■ • • , n - 1 } . 

2 (Xi(5), X i+ i(5)) £ B for more than one i e {1, • • • , n — 1} 

Next, applying the Birch bound [lQj . we derive the following key inequality for this proof: 

H(Y2 +1 (6)\Y 1 m (5),X (5),Z(5)) < g(y) < H(Y?\X (6), Z(5)) | H(Z(S)) | H(X (S)) ^ 

n — m " " n n n 

for any m < n. 

The lower bound part. We first prove that there exists C\ > such that 

H(Y(5))> H(Y{Q)) + dS log 1/5, 
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which immediately implies the lower bound part of the theorem. In this part, we set 

n = \/\og 5 and m = n/2. (38) 

By definition, we have 

H(YZ +1 (5)\Y 1 m (5),X (5) 1 Z(5))/(n - m) = $> 5 (* , Z = 0)H(Y£ +1 (5)\Y?(5),X (5), Z{8) = 0)/(n - m 

XO 

+ ^/(x , Z = l)H(Y£ +1 (5)\Y?(5),X (6), Z{5) = \)/{n - m 

Xo 

+ J2p 6 (xo, Z = 2)H(Y^ +1 (S)\Y 1 m (S),X (6), Z{5) = 2)/(n - m 

x 

= Ti+T 2 + T 3 

where p 5 (xo, Z = 0) means P(Xq(S) = xq, Z(5) = 0). 

We next give estimates for the each of three terms defined as above. 
For T3, notice that n5 < 1 for sufficiently small 5 and then 

Xy(*o, Z = 2) < n 2 (C 6) 2 + n 3 (C <5) 3 + • • • < ^ § n 2 6 2 , 

XO 

for some Cq > 0. It then follows that 

T 3 = ^/(x , Z = 2)H(Y- +1 (5)\Yr(S),X (5), Z(S) = 2)/(n - m) 

so 

< $> 5 (:ro, Z = 2)#(Y« +1 (5))/(n - m) 

xo 

<J2p s (*o,z = 2) \og\y\ 

xo 

= 0(n 2 5 2 ). (39) 

For T 2 , one verifies that for any xo, there exist constants Ci,C% > 0, < Ai < A2 < 1 
such that 

CitwJA? < A^o, Z = 1) < C 2 n5A™. 

Similarly, for any Xo, there exist C3, C4 > 0, and the same < Ai < A2 < 1 as above such 
that 

C 3 mA™ < p\y?\x Q , Z = 1) < C 4 mA™. 
It then follows that for any xq, 

C 5 5\ n 2 /X? <p 5 (y^ +l \y?,X ,Z = 1) < C^/Af, 

which, together with ( 1381) . implies that 

^(l^ +1 (<5)|FrW^o(5),Z((5) = 1) = 0(log 1/5) + 0(n log A 2 ) + 0(m log Ai). 
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This, together with the fact 

p(x ,Z = 1) = 6(n8), 

implies that 

T 2 = 6(5 log 1/5) + 0(n5 log A 2 ) + 0(m5 log Ai). (40) 
For Ti, notice that it can be rewritten as 

Ti = £y x °> z = °) l °gp s (y™+i\y?> x <>> z = °)/(" - ™)- 

One then verifies that 

lAvi, *o, Z = 0)- p°(y?, x Q , Z = 0)U =0 | = O(n5)p (y?, x , Z = 0), 
which implies that 

EAvx,x ,Z = 0)\ogp s (yl +1 \y?,x ,Z = 0) J2p s (y?,x ,Z = 0) log p 5 (y^ +1 \yT, x Q , Z = 0) 



n — m n — m 

When fixing xq and assuming Z = 0, the analyticity argument in [2U] can be used to prove 
that 

5>V, *o, ^ = 0) log/^^ ^0, Z = 0)/(n - m) 
exponentially converges to an analytic function of 5. It then follows that for some < p < 1 

T\ = H(Y(0)) + 0(p m ) + 0(5). (41) 
Combining (HOI) and (Hill , we then have 

F(y- +1 (5)|yr, *<>(*), Z(5))/(n - m) = H(Y(0)) + 6(5 log 1/5). 
The upper bound part. We then prove that there exists C<i > 0, 

H(Y(5))<H(Y(0)) + C 2 5 1 / 2 , 
which immediately implies the upper bound part of the theorem. For this part, setting 

n = 5~ 1/2 and m = 0. (42) 

Using a parallel argument as in the lower bound part, we can still derive (??), (??) and (??) 
and then 

H(Y 1 n \X (5), Z(5))/n = H(Y) 5=0 ,o + 0(5 1 ' 2 ). 

It can verified that 

p(Z(5) = 1) = 0(n5), p(Z(5) = 2) = 0(n 2 5 2 ), 
which, together with the straightforward fact H(Xq(5))/u = O (-), implies that 

2 

H(Z(5)) = -J2p( Z ( 5 ) = i) log p{Z (6) = i) = O (n5 log 5) + 0(n5 log n), 



0(n5). 



i=Q 

and consequently 



H(Z(5)) =0 ^ log ^ + ^ logn y 



n 

The upper bound part then follows from all the above estimates and (|37|) . 
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